System and method for generating an identification signal for electronic devices

ABSTRACT

A system and method for creating a ring tone for an electronic device takes as input a phrase sung in a human voice and transforms it into a control signal controlling, for example, a ringer on a cellular telephone. Time-varying features of the input signal are analyzed to segment the signal into a set of discrete notes and assigning to each note a chromatic pitch value. The set of note start and stop times and pitches are then translated into a format suitable for controlling the device.

FIELD OF THE INVENTION

[0001] This invention relates generally to personal electronic devicesand more particularly to generating personalized ring tones for personalelectronic devices such as cellular telephones.

BACKGROUND OF THE INVENTION

[0002] It is desirable to personalize the presentation of portableelectronic appliances to distinguish one appliance from other similarappliances where they may otherwise be confused or simply to conform thepresentation of an appliance to its owner's personal preferences.Current mobile telephones, for example, provide options for customizingthe ring tone sequence that give the user a choice of what sequence ispleasant to the user's ear, the user's style, and unique to the user'spersonality. The proliferation of affordable mobile handsets andservices has created an enormous market opportunity for wirelessentertainment and voice-based communication applications, a consumerbase that is an order of magnitude larger than the personal computeruser base.

[0003] Although pre-existing sequences of ring tones can be downloadedfrom a variety of Web sites, many users wish to create a unique ringtone sequence. The current applications for creating customized ringtone sequences are limited by the fact that people with musicalexpertise must create them and the users must have Internet access (inaddition to the mobile handset).

[0004] The current methods for generating, sending, and receiving ringtone sequences involve four basic functions. The first function is thecreation of the ring tone sequence. The second function is theformatting of the ring tone sequence for delivery. The third function isthe delivery of the ring tone sequence to a particular handset. Thefourth function is the playback of the ring tone sequence on thehandset. Current methodologies are limited in the first step of theprocess by the lack of available options in the creation step. Allmethodologies must follow network protocols and standards for functionstwo and three for the successful completion of any custom ring tonesystem. Functions two and three could be collectively referred to asdelivery but are distinctly different processes. The fourth function isdependent on the hardware capabilities specific to the handset from themanufacturer and country the handset is sold.

[0005] Current methods for the creation of ring tone sequences involvesome level of musical expertise. The most common way to purchase acustom ring tone sequence is to have someone compose or duplicate apopular song, post the file to a commercial Web site service, previewthe ring tone sequence, then purchase the selection. This is currently avery popular method, but is limited by the requirement of an Internetconnection to preview the ring tone sequences. It also requires themusical expertise of someone else to generate the files.

[0006] Another common system for the creation of ring tone sequences isto key manually, in a sequence of codes and symbols, directly into thehandset. Typically, these sequences are available on various Internetsites and user forums. Again, this is limited to users with an Internetconnection and the diligence to find these sequences and input themproperly.

[0007] A third method involves using tools available through commercialservices and handset manufacturer Web sites that allow the user togenerate a ring tone sequence by creating notes and sounds in acomposition setting such as, a score of music. This involves evengreater musical expertise because it is essentially composing songs noteby note. It also involves the use of an Internet connection.

[0008] Another method of creating a ring tone is to translate recordedmusic into a sequence of tones. There are a number of problems thatarise when attempting to translate recorded music into a ring tonesequence for an electronic device. The translation process generallyrequires segmentation and pitch determination. Segmentation is theprocess of determining the beginning and the end of a note. Prior artsystems for segmenting notes in recordings of music rely on varioustechniques to determine note beginning points and end points. Techniquesfor segmenting notes include energy-based segmentation methods asdisclosed in L. Rabiner and R. Schafer, “Digital Processing of SpeechSignal,” Prentice Hall: 1978, pp. 120-135 and L. Rabiner and B. H.Juang, “Fundamentals of Speech Recognition,” Prentice Hall: New Jersey,1993, pp. 143-149; voicing probability-based segmentation methods asdisclosed in L. Rabiner and R. Schafer, “Digital Processing of SpeechSignal,” Prentice Hall: 1978, pp. 135-139, 156, 372-373, and T. F.Quatieri, “Discrete-Time Speech Signal Processing: Principles andPractice,” Prentice Hall: New Jersey, 2002, pp. 516-519; and statisticalmethods based on stationarity measures or Hidden Markov models asdisclosed in C. Raphael, “Automatic Segmentation of Acoustic MusicalSignals Using Hidden Markov Models,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 21, No. 4, 1999, pp. 360-370.Once the note beginning and endpoints have been determined, the pitch ofthat note over the entire duration of the note must be determined. Avariety of techniques for estimating the pitch of an audio signal areavailable, including autocorrelation techniques, cepstral techniques,wavelet techniques, and statistical techniques as disclosed in L.Rabiner and R. Schafer, “Digital Processing of Speech Signal,” PrenticeHall: 1978, pp. 135-141, 150-161, 372-378; T. F. Quatieri,“Discrete-time Speech Signal Processing,” Prentice Hall, New Jersey,2002, pp. 504-516, and C. Raphael, “Automatic Segmentation of AcousticMusical Signals Using Hidden Markov Models,” IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 21, No. 4, 1999, pp.360-370. Using any of these techniques, the pitch can be measured atseveral times throughout the duration of a note. This resulting sequenceof pitch estimates may then be used to assign a single pitch (frequency)to a note, as pitch estimates vary considerably over the duration of anote. This is true of must acoustic instruments and especially the humanvoice, which is characterized by multiple harmonics, vibrato,aspiration, and other qualities which make the assignment of a singlepitch quite difficult.

[0009] It is desirable to have a system and method for creating a uniquering tone sequence for a personal electronic device that does notrequire musical expertise or programming tasks.

[0010] It is an object of the present invention to provide a system andapparatus to transform an audio recording into a sequence of discretenotes and to assign to each note a duration and frequency from a set ofpredetermined durations and frequencies.

[0011] It is another object of the present invention to provide a systemand apparatus for creating custom ring tone sequences by transforming aperson's singing, or any received song that has been sung, into a ringtone sequence for delivery and use on a mobile handset.

SUMMARY OF THE INVENTION

[0012] The problems of creating an individualized identification signalfor electronic devices are solved by the present invention of a systemand method for generating a ring tone sequence from a monophonic audioinput.

[0013] The present invention is a digital signal processing system fortransforming monophonic audio input into a resulting representationsuitable for creating a ring tone sequence for a mobile device. Itincludes a method for estimating note start times and durations and amethod for assigning a chromatic pitch to each note.

[0014] A data stream module samples and digitizes an analog vocalizedsignal, divides the digitized samples into segments called frames, andstores the digital samples for a frame into a buffer.

[0015] A primary feature estimation module analyzes each buffered frameof digitized samples to produce a set of parameters that representsalient features of the voice production mechanism. The analysis is thesame for each frame. The parameters produced by the preferred embodimentare a series of cepstral coefficients, a fundamental frequency, avoicing probability and an energy measure.

[0016] A secondary feature estimation module performs a representationof the average change of the parameters produced by the primary featureestimation module.

[0017] A tertiary feature estimation module creates ordinal vectors thatencode the number of frames, both forward and backward, in which thedirection of change encoded in the secondary feature estimation modulesremain the same.

[0018] Using the primary, secondary and tertiary features, a two-phasesegmentation module produces estimates of the starting and ending framesfor each segment. Each segment corresponds to a note. The first phase ofthe two-phase segmentation module categorizes the frames into regions ofupward energy followed by downward energy by using the tertiary featurevectors. The second phase of the two-phase segmentation module looks forsignificant changes in the primary and secondary features over thecategorized frames of successive upward and downward energy to determinestarting and ending frames for each segment.

[0019] Finally, after the segments have been determined, a pitchestimation module provides an estimate of each note's pitch based onprimarily the fundamental frequency as determined by the primary featureestimation module.

[0020] A ring tone sequence generation module uses the notes start time,duration, end time and pitch to generate a representation adequate forgenerating a ringing tone sequence on a mobile device. In the preferredembodiment, the ring tone sequence generation module produces outputwritten in accordance with the smart messaging specification (SMS)ringing tone syntax, a part of the Global System for MobileCommunications (GSM) standard. The output may also be in Nokia Ring ToneTransfer Language, Enhanced Messaging Service (EMS) which is a standarddeveloped by the Third Generation Partnership Project (3GPP), iMelodywhich is a standard for defining sounds within EMS, Multimedia MessagingService (MMS) which is standardized by 3GPP, WAV which is a format forstoring sound files supported by Microsoft Corporation and by IBMCorporation, and musical instrument digital interface (MIDI) which isthe standard adopted by the electronic music industry. These outputs aresuitable for being transmitted via smart messaging specification.

[0021] The present invention together with the above and otheradvantages may best be understood from the following detaileddescription of the embodiments of the invention illustrated in thedrawings, wherein:

BRIEF DESCRIPTION OF THE DRAWINGS

[0022]FIG. 1 is a block diagram of a telephone-based song processing andtransmission system according to principles of the invention;

[0023]FIG. 2A is a block diagram of a ring tone sequence subsystem ofFIG. 1;

[0024]FIG. 2B is a block diagram of the primary feature parameters for agiven frame whose values are generated by the primary feature estimationmodule of FIG. 2A;

[0025]FIG. 2C is a block diagram of the secondary feature parameters fora given frame whose values are generated by the secondary featureestimation module of FIG. 2A;

[0026]FIG. 2D is a block diagram of the tertiary feature parameters fora given frame whose values are generated by the tertiary featureestimation module of FIG. 2A;

[0027]FIG. 3 is a block diagram of the two-phase segmentation modules inaccordance with the present invention;

[0028]FIG. 4 is a part block diagram, part flow diagram of the operationof the pitch assignment module including the intranote pitch assignmentsubsystem and the internote pitch assignment subsystem of FIG. 1;

[0029]FIG. 5 is a part block diagram, part flow diagram of the operationof the intranote pitch assignment subsystem of FIG. 4;

[0030]FIG. 6 is a part block diagram, part flow diagram of the operationof the internote pitch assignment subsystem of FIG. 5; and

[0031]FIG. 7 is a block diagram of a networked computer implementationof the system of FIG. 1.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0032]FIG. 1 is a block diagram of a system 10 suitable for accepting aninput of a monophonic audio signal. In a first alternative embodiment ofthe invention, the monophonic audio signal is a vocalized song. Thesystem 10 provides an output of information for programming acorresponding ring tone for mobile telephones according to principles ofthe present invention. The system 10 has a telephony (or mobile) callhandler 50, a ring tone sequence application 40 that transforms vocalinput in accordance with the present invention, and a SMS handler 30.Input signal 5 from a source 2 is received at the call handler 50 forvoice capture. The input signal would be of limited duration, forexample, typically lasting between 5 and 60 seconds. Signals of shorteror longer duration are possible. The voice signal is then digitized andis then transmitted to the ring tone sequence subsystem 40. While theinput shown here is an analog receiver such as an analog telephone, theinput could also be received from a analog-to-digital signal transducer.Further, instead of receiving an input signal over a telephone network,the input signal could instead be received at a kiosk or over theInternet.

[0033] The ring tone sequence subsystem 40 analyzes the digitized voicesignal 15, represents it by salient parameters, segments the signal,estimates a pitch for each segment, and produces a note-based sequence25. The SMS handler 30 processes the note-based sequence 25 andtransmits an SMS containing the ring tone representation 35 of discretetones to a portable device 55 having the capability of “ringing” such asa cellular telephone. The ring tone representation results in an outputfrom the “ringing” device of a series of tones recognizable to the humanear as a translation of the vocal input.

[0034] Ring Tone Sequence Subsystem

[0035]FIG. 2A is a block diagram of the ring tone sequence subsystem 40of FIG. 1. FIG. 2A illustrates in greater detail the main components ofthe ring tone sequence subsystem 40 and the component interconnections.The ring tone sequence subsystem 40 has a data stream module 100, aprimary feature estimation module 120, a secondary feature estimationmodule 130, a tertiary feature estimation module 140, a segmentationmodule 300 comprising a first-phase segmentation module 150 and asecond-phase segmentation module 160, a intranote pitch assignmentsubsystem 170, and a internote pitch assignment subsystem 180.

[0036] In the data stream module, 100, signal preprocessing is firstapplied, as known in the art, to facilitate encoding of the inputsignal. As is customary in the art, the digitized acoustic signal, x, isnext divided into overlapping frames. The framing of the digital signalis characterized by two values: the frame rate in Hz (or the frameincrement in seconds which is simply the inverse of the frame rate) andthe frame width in seconds. In a preferred embodiment of the invention,the acoustic signal is sampled at 8,000 Hz and is enframed using a framerate of 100 Hz and a frame width of 36.4 milliseconds. In a preferredembodiment, the separation of the input signal into frames isaccomplished using a circular buffer having a size of 291 sample storageslots. In other embodiments the input signal buffer may be a linearbuffer or other data structure. The framed signal 115 is output to theprimary feature estimation module 120.

[0037] Primary Feature Estimation Module

[0038] The primary feature estimation module 120, shown in FIG. 2A,produces a set of time varying primary features 125 for each frame ofthe digitized input signal 15. FIG. 2B depicts a “primary datastructure” 125A used to store the primary features 125 for one frame ofthe digitized input signal 15. The primary features generated by theprimary feature estimation module 120 for each frame and stored in theprimary data structure 125A are:

[0039] time-domain energy measure, E, 226

[0040] fundamental frequency, f₀, 222

[0041] cepstral coefficients, {c₀, c₁}, 220

[0042] cepstral-domain energy measure, e, 228,

[0043] voicing probability v, 224

[0044] The primary features are extracted as follows. The input is thedigitized signal, x, which is a discrete-time signal that represents anunderlying continuous waveform produced by the voice or other instrumentcapable of producing an acoustic signal and therefore a continuouswaveform. The primary features are extracted from each frame. Let [x]nrepresent the value of the signal at sample n. The time at sample nrelative to the beginning of the signal, n=0, is n/f_(s), where f_(s) isthe sampling frequency in Hz. Let F(i) represent the index set of all nin frame i, and N_(F) the number of samples in each frame.

[0045] The time-domain energy measure is extracted from frame iaccording to the formula $\begin{matrix}{{E\lbrack i\rbrack} = {\frac{1}{N_{F}}{\sum\limits_{m \in {F{(i)}}}{\quad \left\lbrack {{w\left( {m - i} \right)}\left( {{x\lbrack m\rbrack} - \overset{\_}{x}} \right)} \right\rbrack^{2}}}}} & (1)\end{matrix}$

[0046] where {overscore (x)} is the mean of x[m] for all mεF(i) and w isa window function. Equation 1 states that time-domain energy measure 226is extracted by multiplying the signal with the mean removed by thewindow, summing the square of the result, and normalizing by the numberof samples in the frame. The window w reaches a maximum at the center ofthe frame and reaches a minimum at the beginning and end of the frame.The window function is a unimodal window function. The preferredembodiment uses a Hamming window. Other types of windows that may beused include a Hanning window, a Kaiser window, a Blackman window, aBartlett window and a rectangular window.

[0047] The fundamental frequency 222 is estimated by looking forperiodicity in x. The fundamental frequency at frame i, is calculated byestimating the longest period in frame i, T₀[i], and taking its inverse,$\begin{matrix}{{f_{0}\lbrack i\rbrack} = \frac{1}{T_{0}\lbrack i\rbrack}} & (2)\end{matrix}$

[0048] In the preferred embodiment, f₀[i] is calculated using frequencydomain techniques. Pitch detection techniques are well known in the artand are described, for example, in L. Rabiner and R. Schafer, “DigitalProcessing of Speech Signal,” Prentice Hall: 1978, pp. 135-141, 150-161,372-378; T. F. Quatieri, “Discrete-time Speech Signal Processing,”Prentice Hall, New Jersey, 2002, pp. 504-516. The cepstral coefficients220 are extracted using the complex cepstrum by computing the inversediscrete Fourier transform of the complex natural logarithm of theshort-time discrete Fourier transform of the windowed signal. Theshort-time discrete Fourier transform is computed using techniquescustomary in the prior art. Let X[i,k] be the discrete Fourier transformof the windowed signal, which is computed according to the formula$\begin{matrix}{{X\left\lbrack {i,\quad k} \right\rbrack} = {{\sum\limits_{m \in {F{(i)}}}}\quad {w\left( {m - 1} \right)}\left( {{x\lbrack m\rbrack} - \overset{\_}{x}} \right)^{\frac{{- {j2}}\quad \pi \quad {mk}}{N}}}} & (3)\end{matrix}$

[0049] where N is the size of the discrete Fourier transform and F′(i)is F(i) with N-N_(F) zeros added.

[0050] The cepstral coefficients are computed from the discrete Fouriertransform of the natural logarithm of X[i,k] as $\begin{matrix}{{c_{m}\lbrack i\rbrack} = {\sum\limits_{k = 0}^{N - 1}\quad {\log \quad {X\left\lbrack {i,\quad k} \right\rbrack}^{\frac{{j2}\quad \pi \quad {mk}}{N}}}}} & (4)\end{matrix}$

[0051] where

log X[i,k]=log|X[i,k]|+jAngle(X[i,k])  (5)

[0052] and where Angle(X[i,k]) is the angle between the real andimaginary parts of X[i,k]. In the preferred embodiment, the primaryfeatures include the first three cepstral coefficients, i.e., c_(m)[i]for m={0, 1}. Cepstral coefficients, derived from the inverse Fouriertransform of the log magnitude spectrum generated from a short-timeFourier transform of one frame of the input signal, are well known inthe art and is described, for example, in L. Rabiner and B. H. Juang,“Fundamentals of Speech Recognition,” Prentice Hall: New Jersey, 1993,pp. 143-149, which is hereby incorporated by reference as backgroundinformation.

[0053] The cepstral-domain energy measure 228 is extracted according tothe formula $\begin{matrix}{{e\lbrack i\rbrack} = \frac{{c_{o}\left\lbrack i^{\prime} \right\rbrack} - {\overset{\_}{c}}_{0}}{\max\limits_{i^{\prime}}\left( {c_{o}\left\lbrack i^{\prime} \right\rbrack} \right)}} & (6)\end{matrix}$

[0054] The cepstral-domain energy measure represents the short-timecepstral gain with the mean value removed and normalized by the maximumgain over all frames.

[0055] The voicing probability measure 224 is defined as the pointbetween the voiced and unvoiced portion of the frequency spectrum forone frame of the signal. A voiced signal is defined a signal thatcontains only harmonically related spectral components whereas anunvoiced signal does not contain harmonically related spectralcomponents and can be modeled as filtered noise. In the preferredembodiment, if v=1 the frame of the signal is purely voiced; if v=0, theframe of the signal is purely unvoiced.

[0056] Secondary Feature Estimation Module

[0057] The secondary feature estimation module 130, shown in FIG. 2A,produces a set of time varying secondary features 135 for based on eachof the features 125. FIG. 2C depicts a “secondary data structure” 135Aused to store the secondary features 135 for one frame of the digitizedinput signal 15. The secondary feature estimation module 135 generatessecondary features by taking short-term averages of the primary features125 output from the primary feature estimation module 120. Short-termaverages are typically taken over 2-10 frames. In a preferredembodiment, short-term averages are computed over three consecutiveframes. Secondary features generated for each frame and stored in thesecondary data structure 135A are:

[0058] short-term average change in time-domain energy E, {overscore(ΔE)}, 242

[0059] short-term average change in fundamental frequency f₀, {overscore(Δf₀)}, 236

[0060] short-term average change in cepstral coefficient c₁, {overscore(Δc₁)}, 232

[0061] short-term average change in cepstral-domain energy e, {overscore(Δe)}, 240

[0062] Tertiary Feature Estimation Module

[0063] The tertiary feature estimation module 140, shown in FIG. 2A,produces a set of time varying tertiary features 145 based on two of thefive secondary features 135. FIG. 2D depicts a “tertiary data structure”145A used initially to store the tertiary features 145 for one frame ofthe digitized input signal 15. The tertiary feature estimation module145 generates tertiary features that represent the number of consecutiveframes for which a given primary feature 135 changed in the samedirection. Tertiary features generated for each frame and stored in thetertiary data structure 145A are:

[0064] count of consecutive upward short-term average change incepstral-domain energy e, N({overscore (Δe)}>0), 244

[0065] count of consecutive downward short-term average change incepstral-domain energy e, N({overscore (Δe)}<0), 246

[0066] count of consecutive upward short-term average change infundamental frequency f₀, N({overscore (Δf₀)}>0), 248

[0067] count of consecutive downward short-term average change infundamental frequency f₀, N({overscore (Δf₀)}<0), 250

[0068] In the preferred embodiment, counters N(a) are provided for eachframe for each of the four tertiary features. The counters are resetwhenever the argument a is false. The function N(a) is a function ofboth the frame number “a” and the particular feature being counted. Forexample, N(a) for short-term average change in f₀ is false when thevalue of the short-term average change at frame “a” is less than zero.

[0069] Two-Phase Segmentation Module

[0070]FIG. 3 is a block diagram of the two-phase segmentation module 300including the first-phase segmentation module 150 and the second-phasesegmentation module 160, shown in FIG. 2A. The first-phase segmentationmodule 150 groups successive frames into regions based on two of thetertiary features 145. A region is a set of frames in which the changein energy increases immediately followed by frames in which the changein energy decreases. Specifically, the tertiary features N({overscore(Δe)}>0), 244 and N({overscore (Δe)}<0), 246 are used to groupsuccessive frames into regions. A region, in order to be valid, musthave at least a minimum number of frames, for example 10 frames. Aregion is defined in this way because a valid start frame, i.e. a notestart, is a transitory event when energy is in flux. That is, a notedoes not start when the energy is flat, or when it is decreasing, orwhen it is continually increasing. A note start is generallycharacterized by an increase in energy followed by an immediate decreasein the change in energy. Typically there are 4-12 frames of increasingenergy followed by 10-35 frames of decreasing energy.

[0071] For each region determined by the first-phase segmentation module150, a candidate note start frame is estimated. Within the region, thecandidate start frame is determined as the last frame within the regionin which the tertiary feature N({overscore (Δe)}>0), 244 contains anon-zero count. The second-phase segmentation module 160 determineswhich regions contain valid note start frames. Valid note start framesare determined by selecting all regions estimated by the first-phasesegmentation module 150 that contain significant correlated changewithin regions. Each region starts when a given frame of N({overscore(Δe)}>0), 244 contains a non-zero count and the previous frame ofN({overscore (Δe)}>0), 244 contains a zero.

[0072] The second-phase segmentation module 160 uses threethreshold-based criteria for determining which regions and theircorresponding start frames actually represent starting note boundaries.The first criteria is based on the primary feature which is the cepstraldomain energy measure e. Each frame is evaluated within a valid regionas determined by the first-phase segmentation process. A frame, within avalid region, is marked if it is greater than a cepstral domain energythreshold and the previous frame is less than the threshold. An examplevalue of the cepstral domain energy threshold is 0.0001. If a validregion has any marked frames, the corresponding start frame based onN({overscore (Δe)}>0) is chosen as a start frame representing an actualnote boundary.

[0073] The second and third criteria use parameters to select whether aframe within a valid region R is marked. The parameter used by thesecond criteria, referred to herein as the fundamental frequency rangeand denoted by Range(f₀[i],R), is calculated according to${{Range}\left( {{f_{0}\lbrack i\rbrack},R} \right)} = {{\max\limits_{i \in R}\left( {f_{0}\lbrack i\rbrack} \right)} - {\min\limits_{i \in R}{\left( {f_{0}\lbrack i\rbrack} \right).}}}$

[0074] An example fundamental frequency range threshold is 0.45 MIDInote numbers. Equation 7 provides a conversion from hertz to MIDI notenumber.

[0075] The parameter used by the third criteria, referred to herein asthe energy range and denoted by Range(e[i],R), is calculated similarly.An example value of the energy threshold is a 0.2.

[0076] The candidate note start frame, within a valid region, is chosenas a start frame representing an actual note boundary if the fundamentalfrequency range and energy range or cepstral domain energy measureexceed these thresholds.

[0077] For each start frame, resulting from the three criteria describedabove, a corresponding stop frame of the note boundary is found byselecting the first frame that occurs after each start frame in whichthe primary feature e for that frame drops below the cepstral domainenergy threshold. In the preferred embodiment, if e does not drop belowthe cepstral domain energy threshold on a frame prior to the next startframe, the stop frame is given to be a predefined number of framesbefore the next start frame. In the preferred embodiment of theinvention, this stop frame is between 1 and 10 frames before the nextstart frame.

[0078] The output of the Two-Phase Segmentation Module is a list of notestart and stop frames.

[0079] In the preferred embodiment, a segmentation post-processor 166 isused verify the list of note start and stop frames. For each note, whichconsists of all frames between each pair of start and stop frames, threevalues are calculated, which include the average voicing probability v,the average short-time energy e and the average fundamental frequency.These values are used to check whether the corresponding note should beremoved from the list. For example, in the preferred embodiment, if theaverage voicing probability for a note is less than 0.12, the note isclassified as a “breath” sound or a “noise” and is removed from the listsince it is not considered a “musical” note. Also, for example, in thepreferred embodiment, if the average energy e is less than 0.0005, thenthe note is considered “non-musical” as well and is classified as“noise” or “un-intentional sound”.

[0080] Pitch Assignment Module

[0081]FIG. 4 shows the process of the pitch assignment module includingthe intranote pitch assignment subsystem 170 and the internote pitchassignment subsystem 180 of FIG. 1. The Pitch Assignment Module acceptsas input the output of the Two-Phase Segmentation Module and the PrimaryFeature Estimation Module, and assigns a single pitch to each notedetected by the Two-Phase Segmentation Module, step 190. This output isfirst sent to the intranote pitch assignment subsystem, step 200. Outputfrom the intranote pitch assignment subsystem, step 205 is sent to theinternote pitch assignment system, step 205. The Intranote PitchAssignment Subsystem 170 and the Internote Pitch Assignment Subsystem180, determine the assigned pitch for each note in the score. The majordifference between these two subsystems is that the Intranote PitchAssignment Subsystem does not use contextual information (i.e., featurescorresponding to prior and future notes) to assign MIDI note numbers tonotes, whereas the Internote Pitch Assignment Subsystem does make use ofcontextual information from other notes in the score. The output of thepitch assignment module is a final score data structure, 210. The scoredata structure includes the starting frame number, the ending framenumber, and the assigned pitch for each note in the sequence. Theassigned pitch for each note is an integer between 32 and 83 thatcorresponds to the Musical Instrument Digital Interface (MIDI) notenumber.

[0082] The set of primary features between and including the startingand ending frame numbers are used to determine the assigned pitch foreach note as follows. Let S_(j) denote the set of frame indices betweenand including the starting and ending frames for note j. The set offundamental frequency estimates within note j is denoted by{f₀[i],∀iεS_(j)}.

[0083]FIG. 5 shows the operation of the intranote pitch assignmentsubsystem, 170. The Intranote Pitch Assignment Subsystem consists offour processing stages: the Energy Thresholding Stage 201, the VoicingThresholding Stage 202, the Statistical Processing Stage 203, and thePitch Quantization Stage 204. The Energy Thresholding Stage removes fromS_(j) fundamental frequency estimates with corresponding time-domainenergies less than a specified energy threshold, which is for example0.1 and creates a modified frame index set S_(j) ^(E). The VoicingThresholding Stage removes from S_(j) ^(E) fundamental frequencyestimates with corresponding voicing probabilities less than a specifiedvoicing probability threshold and creates a modified frame index setS_(j) ^(EV). An example value of the voicing probability threshold is0.5. The Statistical Processing Stage computes the median and mode of{f₀[i], ∀iεS_(j) ^(EV)} and classifies {f₀[i], ∀iεS_(j) ^(EV)} into oneor more distributional types with a corresponding confidence estimatefor the classification decision. Distributional types may be determinedthrough clustering as described in K. Fukunaga, Statistical PatternRecognition, 2nd Ed. Academic Press, 1990, p.510. In a preferredembodiment, the distributional types are flat, rising, falling, andvibrato, however many more distributional types are possible. Also in apreferred embodiment of the invention, the class decisions are made bychoosing the class with the minimum squared error between the classtemplate vector and the fundamental frequency vector with elements{f₀[i],∀iεS_(j) ^(EV)}. The mode is computed in frequency binscorresponding to quarter tones of the chromatic scale. The PitchQuantization Stage accepts as input the median, mode, distributionaltype, and class confidence estimate and assigns a MIDI note number tothe note. A given fundamental frequency in Hz is converted to a MIDInote number according to the formula $\begin{matrix}{m = {m_{A} + {12{\log_{2}\left( \frac{f_{0}}{f_{A}} \right)}}}} & (7)\end{matrix}$

[0084] where m_(A)=69 and f_(A)=440 Hz. In the preferred embodiment,MIDI note numbers are assigned as follows. For flat distributions withhigh confidence, the MIDI note number is the nearest MIDI note integerto the mode. For rising and falling distributions, the MIDI note numberis the nearest MIDI note integer to the median if the note duration isless than 7 frames and the nearest MIDI note integer to the modeotherwise. For vibrato distributions, the MIDI note number is thenearest MIDI note integer to the mode.

[0085]FIG. 6 shows the operation of the internote pitch assignmentsubsystem 180. The Internote Pitch Assignment Subsystem consists of twoprocessing stages: the Key Finding Stage 207 and the Pairwise CorrectionStage 206. The Key Finding Stage assigns the complete note sequence ascale in the ionic or aolian mode, based on the distribution of Tonic,Mediant and Dominant pitch relationships that occur in the sequence. Ascale is created for each chromatic pitch class, that is for C, C#, D,D#, E, F, F#, G, G#, A, A# and B. Each pitch class is also assigned aprobability weighted according to scale degree. For example, the first,sixth, eighth and tenth scale degrees are given negative weights and thezeroth (the tonic), the second, the fourth, fifth, seventh and ninth aregiven positive weights. The zeroth, fourth and seventh scale degrees aregiven additional weight because they form the tonic triad in a majorscale.

[0086] The note sequence is compared to the scale with the highestprobability as a template, and a degree of fit is calculated. In thepreferred implementation the measure of fit is calculated by scoringpitch occurrences of Tonic, Mediant and Dominant pitch functions asinterpreted by each scale. The scale with the highest number of Tonic,Mediant and Dominant occurrences will have the highest score. Thecomparison may lead to a change of the MIDI note numbers of notes in thescore that produce undesired differences. The differences are calculatedin the Pairwise Correction Stage.

[0087] In the Pairwise Correction Stage, MIDI note numbers that do notfit the scale template are first examined. A rules-based decision treeis used to evaluate a pair of pitches—the nonconforming pitch and thepitch that precedes it. Such rule-based decision tree based on SpeciesCounterpoint voice-leading rules are well known in the art, and aredescribed, for example, in D. Temperley, “The Cognition of Basic MusicalStructure,” The MIT Press, Cambridge, Mass., 2001, pp. 173-182. Therules are then used to evaluate the pair of notes consisting of thenonconforming pitch and the pitch that follows it. If both pairs conformto the rules, the nonconforming pitch is left unaltered. If the pairs donot conform to the rules the nonconforming pitch is modified to fitwithin the assigned scale.

[0088] The corrected sequence is again examined to identify pairs thatmay not conform to the voice-leading rules. Pairs that do not conformare labeled dissonant and may be corrected. They are corrected ifadjusting one note in the pair does not cause a dissonance (dissonanceis defined by standard Species Counterpoint rules) in an adjacent paireither preceding or following the dissonant pair.

[0089] Each pair is then compared to the frequency ratios derived duringthe Pitch Quantization Stage. If a pair can be adjusted to moreaccurately reflect the ratio expressed by pairs of frequencies, it isadjusted to more accurately reflect that ratio. In the preferredimplementation, the adjustment is performed by raising or lowering apitch from a pair if it does not cause a dissonance in an adjacent pair.

[0090] Computer System Implementation

[0091]FIG. 7 depicts a computer system 400 incorporating a recording andnote generation, in place of the call handling and SMS handling,respectively, shown in FIG. 1. This is another preferred embodiment ofthe present invention. The computer system includes a central processingunit (CPU) 402, a user interface 404 (e.g., standard computer interfacewith a monitor, keyboard and mouse or similar pointing device), an audiosignal interface 406, a network interface 408 or similar communicationsinterface for transmitting and receiving signals to and from othercomputer systems, and memory 410 (which will typically include bothvolatile random access memory and non-volatile memory such as disk orflash memory).

[0092] The audio signal interface 406 includes a microphone 412, lowpass filter 414 and analog to digital converter (ADC) 416 for receivingand preprocessing analog input signals. It also includes a speakerdriver 418 (which includes a digital to analog signal converter andsignal shaping circuitry commonly found in “computer sound boards”) andan audio speaker 420.

[0093] The memory 410 stores an operating system 430, applicationprograms 50, and the previously described signal processing modules. Theother modules stored in the memory 410 have already been described aboveand are labeled with the same reference numbers as in the other figures.

[0094] Alternate Embodiments

[0095] While the present invention has been described with reference toa few specific embodiments, the description is illustrative of theinvention and is not to be construed as limiting the invention. Variousmodifications may occur to those skilled in the art without departingfrom the true spirit and scope of the invention as defined by theappended claims.

[0096] For instance, the present invention could be embedded in acommunication device, or stand-alone game device or the like. Further,the input signal could be a live voice, an acoustic instrument, aprerecorded sound signal, or a synthetic source.

[0097] It is to be understood that the above-described embodiments aresimply illustrative of the principles of the invention. Various andother modifications and changes may be made by those skilled in the artwhich will embody the principles of the invention and fall within thespirit and scope thereof.

What is claimed is:
 1. A method for generating an identification signal,comprising the steps of: accepting as input a monophonic audio signal oflimited duration; translating said monophonic audio signal to arepresentation of a series of discrete tones; and producing a controlsignal from said representation of discrete tones, control signalsuitable for causing a transponder to generate a signal, where saidgenerated signal is human-recognizable as a translation of saidmonophonic audio signal.
 2. A method for generating an identificationsignal, comprising the steps of: accepting as input a voice signal oflimited duration; translating said voice signal to a representation of aseries of discrete tones; and producing a control signal from saidrepresentation of discrete tones, said control signal suitable forcausing a transponder to generate a signal, where said generated signalis human-recognizable as a translation of said voice signal.
 3. Themethod of claim 2 wherein said generated signal is melodicallyhuman-recognizable.
 4. The method of claim 2 wherein said generatedsignal is rhythmically human-recognizable.
 5. The method of claim 2wherein accepting as input further comprises receiving said voice signalover a telephone connection.
 6. The method of claim 5 wherein saidtelephone connection is wireless.
 7. The method of claim 2 wherein saidstep of accepting as input further comprises receiving said voice signalover a microphone attached to a computer.
 8. The method of claim 2wherein said translating step further comprises translating said voicesignal to a range of tones within the capability of a mobile telephoneaudio output synthesizer.
 9. The method of claim 2 further comprisingthe step of transmitting said control signal to a tone-producing outputdevice responsive to said control signal.
 10. The method of claim 2wherein said translating step further comprises the steps of: generatinga digital representation of said voice signal; dividing said digitizedsignal into a plurality of frames; extracting analysis data from eachsaid frame; and formatting said analysis data into a framerepresentation.
 11. The method of claim 10 wherein said framerepresentation further comprises a plurality of signal parametersincluding a time-domain energy measure, a fundamental frequency value,cepstral coefficients, and a cepstral-domain energy measure.
 12. Themethod of claim 11 further comprising the step of determining saidtime-domain energy measure by multiplying the signal in a selected framewith a mean removed by a window function, summing the square of theresult, and normalizing the summed square by the number of samples insaid selected frame.
 13. The method of claim 12 wherein said windowfunction is a unimodal window function.
 14. The method of claim 11further comprising the step of determining a fundamental frequency of aselected frame by determining the lowest significant periodic componentof the signal of said selected frame.
 15. The method of claim 11 furthercomprising the step of determining cepstral coefficients of a selectedframe by computing the inverse discrete Fourier transform of the complexnatural logarithm of the short-time discrete Fourier transform of thesignal of a selected frame, said signal windowed by a window function.16. The method of claim 11 further comprising the step of determiningsaid ceptstral-domain energy measure by determining a short-timecepstral gain with the mean value removed, said short-time cepstral gainnormalized by the maximum gain over all frames.
 17. The method of claim11 further comprising the step of determining short-term averages ofsaid plurality of signal parameters.
 18. The method of claim 17 furthercomprising the step of determining each said short-term average overthree consecutive frames.
 19. The method of claim 17 further comprisingthe step of determining creating ordinal vectors encoding the number offrames in which directionality of change as determined by saidshort-term averages remains the same.
 20. The method of claim 19 whereinsaid ordinal vectors further comprise a count of consecutive upwardshort-term average change in cepstral-domain energy, a count ofconsecutive downward short-term average change in cepstral-domainenergy, a count of consecutive upward short-term average change infundamental frequency, and a count of consecutive downward short-termaverage change in fundamental frequency.
 21. The method of claim 20further comprising the step of determining each count for each frame insaid signal.
 22. The method of claim 10 further comprising the step ofsegmenting said signal by counting instances of increased signalamplitude in said frames, and for each instance of increased amplitude,determining a change in each of pitch, energy, and spectral compositionin a region around said instance of increased amplitude, whereby asegment is defined by a start frame having an instance of increasedamplitude and an end frame is defined by changes in pitch, energy andspectral composition in relation to selected thresholds.
 23. The methodof claim 10 wherein said translating step further comprises groupingsaid frames into a plurality of regions.
 24. The method of claim 23wherein each said region is determined from a count of consecutiveupward short-term average change in cepstral-domain energy followed by acount of consecutive downward short-term average change incepstral-domain energy.
 25. The method of claim 23 further comprisingthe step of determining the existence of a candidate note start frame ineach said region.
 26. The method of claim 24 further comprising the stepof determining a candidate note start frame in each said region as thelast frame within said region in which the count of consecutive upwardshort-term average change in cepstral-domain energy is not zero.
 27. Themethod of claim 25 further comprising the step of determining whichregions of said plurality have a valid note start frame.
 28. The methodof claim 25, wherein determining a candidate note start frame furthercomprises the step of determining if the cepstral domain energy of aparticular frame is greater than a cepstral domain energy threshold anda frame immediately before said particular frame was below said cepstraldomain energy threshold.
 29. The method of claim 25, wherein determininga candidate note start frame further comprises the step of determiningwhether a fundamental frequency range of a particular frame is above afundamental frequency range threshold and whether an energy range forsaid particular frame is above an energy range threshold.
 30. The methodof claim 25, further comprising the step of determining a stop framecorresponding to each start frame.
 31. The method of claim 26, furthercomprising the step of determining a stop frame by locating the firstframe after a start frame in which cepstral energy is below saidcepstral domain energy threshold.
 32. The method of claim 31, furthercomprising the step of defining the stop frame as a frame between twoand ten frames before a subsequent start frame if no frame havingcepstral energy below said cepstral domain energy threshold is found.33. The method of claim 30 further comprising the step of verifying eachstart and stop frame pair by determining whether a) average voicingprobability is above a voicing probability threshold, b) averageshort-time energy is above an average short-time energy threshold, andc) average fundamental frequency is above an average fundamentalfrequency threshold.
 34. The method of claim 30 further comprising thesteps of: forming an initial set of fundamental frequencies from saidstart and corresponding stop frames; removing from said initial setthose fundamental frequencies having corresponding time-domain energiesless than an energy threshold to form a modified set of fundamentalfrequencies; removing from said modified set those fundamentalfrequencies having corresponding voicing probabilities less than avoicing probability threshold to form a twice modified set offundamental frequencies; determining a median for each member of saidtwice modified set; determining a mode for each member of said twicemodified set; determining a distributional type for each member of saidtwice modified set with an associated class confidence estimate; andassigning a MIDI note number to each member of said twice modified setin response to said mode, said median, said distributional type and saidclass confidence estimate, whereby a note sequence is created.
 35. Themethod of claim 34, further comprising the steps of: creating aplurality of scales, one for each chromatic pitch class in said notesequence; assigning a probability to each pitch class, said probabilityweighted according to scale degree of each note; comparing each saidplurality of scales to said note sequence to find a best fit scale basedon occurrences of Tonic, Mediant, and Dominant of a particular scale incomparison to the note sequence; and selecting the scale with thehighest degree of matching.
 36. The method of claim 35 wherein said stepof assigning probability further comprises: assigning negativeprobability weights to the first, sixth, eighth, and tenth scale degreesand positive probability weights to the zeroth, second, fourth, fifth,seventh, and ninth scale degree.
 37. The method of claim 36 whereinassigning positive probability further comprises the step of assigningadditional positive probability weight to the zeroth, fourth, andseventh scale degree.
 38. The method of claim 35 wherein said comparingstep further comprises: ranking said plurality of scales in order ofprobability; and comparing said each plurality of scales with said notesequence in order of probability.
 39. The method of claim 35, furthercomprising the steps of: examining a first pitch pair having a firstnote having a non-conforming pitch and a second note preceding thefirst; if said pitch pair does not conform to voice leading rules, thenadjusting said first note unless said adjustment causes dissonance in anadjacent pitch pair.
 40. Apparatus for generating an identificationsignal comprising: a voice signal receiver; a translator having as itsinput a voice signal received by said voice signal receiver and havingas its output a representation of discrete tones where an audiopresentation of said discrete tones would be human-recognizable as atranslation of said voice signal.
 41. The apparatus of claim 40 whereinsaid voice signal receiver comprises an analog telephone receiver. 42.The apparatus of claim 40 wherein said voice signal receiver furthercomprises a voice-to-digital signal transducer.
 43. The apparatus ofclaim 40 wherein said voice signal receiver further comprises arecording device.
 44. The apparatus of claim 40 wherein said translatorfurther comprises a feature estimation module to determine values for atleast one time-varying feature of said input signal.
 45. The apparatusof claim 44 wherein said translator further comprises a segmentationmodule responsive to output of said feature estimation module and energyof said input to segment said input signal into notes and a pitchassignment module responsive to signal energy in each segment output bysaid segmentation module.
 46. The apparatus of claim 44 wherein saidfeature estimation module further comprises a primary feature module, asecondary feature module and a tertiary feature module.
 47. Theapparatus of claim 46 wherein said primary feature module determines aplurality of values for each of time-domain energy, fundamentalfrequency, cepstral-domain energy, and voicing probability.
 48. Theapparatus of claim 46 wherein said secondary feature module determines aplurality of values for each of the secondary features of short-termaverage change in energy, short-term average change in fundamentalfrequency, short-term average change in cepstral coefficient, andshort-term average change in cepstral-domain energy.
 49. The apparatusof claim 48 wherein each said secondary value is computed over threeconsecutive frames of said input signal.
 50. The apparatus of claim 48wherein said tertiary feature module determines a plurality of valuesfor at least one of said secondary features.
 51. The apparatus of claim45 wherein said segmentation module further comprises a first-phasesegmentation module and a second-phase segmentation module.
 52. Theapparatus of claim 51 wherein said first-phase segmentation modulegroups a plurality of successive frames of said input signal into atleast one region in response to output of said feature estimationmodule.
 53. The apparatus of claim 52 wherein said region is a pluralityof frames in which a change in energy increases immediately followed byframes in which change in energy decreases.
 54. The apparatus of claim53 in which said region has a minimum number of frames.
 55. Theapparatus of claim 52 wherein said second-phase segmentation moduledetermines if said at least one region has a valid note start frame andif so, determines a stop frame.
 56. The apparatus of claim 55 whereinsaid second-phase segmentation module determines said valid note startframe in response to cepstral domain energy by determining whether aframe has a cepstral domain energy greater than a cepstral domain energythreshold preceded by a frame having a cepstral domain energy less thansaid cepstral domain threshold.
 57. The apparatus of claim 55 whereinsaid second-phase segmentation module determines a valid note startframe if the fundamental frequency exceeds a fundamental energythreshold and if the non-cepstral domain energy exceeds an energythreshold.
 58. The apparatus of claim 52 further comprising asegmentation post-processor to verify said start and stop frame inresponse to average voicing probability, average short-time energy, andaverage fundamental frequency of said start and stop frame.
 59. Theapparatus of claim 45 wherein said pitch assignment module assigns aninteger between 32 and 83, said integer corresponding to the MIDI notenumber for pitch.
 60. The apparatus of claim 45 wherein said pitchassignment module comprises an intranote pitch assignment subsystem andan internote pitch assignment subsystem.
 61. The apparatus of claim 60wherein said intranote pitch assignment subsystem determines pitch inresponse to time-domain energy, voicing probability, median, and mode ofeach said segment output by said segmentation module.
 62. The apparatusof claim 61 wherein said intranote pitch assignment subsystem furthercomprises an energy thresholding stage to remove from a set offundamental frequencies for a particular segment those fundamentalfrequencies whose corresponding time-domain energy are less than anenergy threshold to produce a modified set of fundamental frequency forsaid particular segment.
 63. The apparatus of claim 62 wherein saidintranote pitch assignment system further comprises a voicingthresholding stage to remove fundamental frequencies from said modifiedset whose corresponding voicing probabilities are less than a voicingprobability threshold to produce a twice-modified set of fundamentalfrequencies for said particular segment.
 64. The apparatus of claim 63wherein said intranote pitch assignment system further comprises astatistical processing stage to compute a media and a mode for saidtwice modified fundamental frequency set and to classify said segment asa distributional type in response to said median and said mode.
 65. Theapparatus of claim 64 wherein said segment is classified as a pluralityof distributional types.
 66. The apparatus of claim 64 wherein saidintranote pitch assignment system further comprises a pitch quantizationstage to assign a MIDI note number to said particular segment inresponse to said median, said mode and said distributional type.
 67. Theapparatus of claim 66 wherein said statistical processing stage furtherdetermines a decision confidence estimate corresponding to thedetermination of said distributional type, and said pitch quantizationstage includes said confidence estimate in the assignment of said MIDInote number.
 68. The apparatus of claim 60 wherein said internote pitchassignment subsystem corrects pitches determined by said intranote pitchassignment subsystem.
 69. The apparatus of claim 68 wherein saidinternote pitch assignment subsystem further comprises a key findingstage to assign a scale to a note sequence output by said intranotepitch assignment subsystem.
 70. The apparatus of claim 68 wherein saidinternote pitch assignment subsystem further comprises a pairwisecorrection stage to examine a pitch and its preceding pitch forconformity to voice-leading rules, if a pair is determined to bedissonant according to said voice-leading rules, the internote pitchassignment subsystem corrects the pitches of said pair if the pitchadjustment does not cause dissonance in an adjacent pair.